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Abstract 

Using the exome sequencing data from 697 unrelated individuals and their simulated disease phenotypes from 
Genetic Analysis Workshop 17, we develop and apply a gene-based method to identify the relationship between a 
gene with multiple rare genetic variants and a phenotype. The method is based on the Mantel test, which assesses 
the correlation between two distance matrices using a permutation procedure. Using up to 100,000 permutations to 
estimate the statistical significance in 200 replicate data sets, we found that the method had 5.1% type I error at an a 
level of 0.05 and had various power to detect genes with simulated genetic associations. FLT1 and KDR had the most 
significant correlations with Q1 and were replicated 170 and 24 times, respectively, in 200 simulated data sets using a 
Bonferroni corrected p-value of 0.05 as a threshold. These results suggest that the distance correlation method can 
be used to identify genotype-phenotype association when multiple rare genetic variants in a gene are involved. 



Background 

Genome-wide association studies have successfully iden- 
tified hundreds of novel genetic loci associated with com- 
mon diseases; however, only a small portion of the 
heritability can be explained by these associated common 
variants. An alternative but not mutually exclusive 
hypothesis to account for a sizable proportion of genetic 
susceptibility to common diseases proposes a summation 
of effects of rare variants in many genes, each conferring 
an increase in relative risk. In contrast to common var- 
iants associated with small effects, rare variants located 
in a functional region (e.g., exons) are more likely to 
cause functional effects themselves. 

As a result of the low allele frequencies, traditional 
regression-based methods do not work well with the rare 
variants derived from the sequencing data. A few meth- 
ods have been developed to address this challenge by 
summarizing individual rare variants for association ana- 
lysis; they are reviewed in a Genetic Analysis Workshop 
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17 (GAW17) summary paper [1]. For exome sequencing 
data, a convenient unit of summarizing genetic variants 
is the gene. GAW17 provides exome sequencing data 
from the 1000 Genomes Project and simulated phenoty- 
pic traits, both binary and quantitative. 

In this study, we explore the gene-based analyses to iden- 
tify genes associated with these traits by summarizing all 
rare variants within a gene. We develop a gene-based 
method by testing the correlation between the dissimilarity 
(measured as pairwise distances between subjects) of the 
trait and the genotype. We hypothesize that subject pairs 
that have similar phenotypes will also have similar geno- 
types within certain genes and conversely that subject pairs 
with dissimilar phenotypes will have dissimilar phenotypes. 
Based on the Mantel test [2], we perform a series of ana- 
lyses to identify genotype-phenotype associations of this 
type. With up to 100,000 permutations to compute the 
empirical ^-values, using this approach, we were able to 
identify genes that as a whole were associated with the 
simulated traits after correcting for multiple testing. We 
also examine the type I error rate and power of the Mantel- 
based method using the GAW17 simulation answers. 
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Methods 

Data 

In this study, we use the data set with 697 unrelated indi- 
viduals provided by GAW17 to conduct the gene-based 
analysis. The genotypes, sex, and population data of these 
individuals are from the 1000 Genomes Project [3]. Two 
hundred replicates of the trait simulation were carried out. 
The genotypes were held fixed for all 200 simulation repli- 
cates. There are 24,487 autosomal SNPs on 3,205 genes 
available. For each SNP, the name of the SNP, its chromo- 
some and base-pair position, the name of the gene in 
which it is located, whether the SNP is synonymous or 
nonsynonymous, and the minor allele and minor allele fre- 
quency (MAF) were also provided. Three quantitative 
traits (Ql, Q2, and Q4) and a binary trait (Affected, coded 
0 = no and 1 = yes) were available for each replicate. The 
Age and Sex variables were fixed across all 200 replicates, 
and the smoking status covariate, Smoke, varied across the 
replicates [4]. 

Among the 24,487 SNPs, 87.2% (21,355) have MAF < 
0.05, 74% (18,131) have MAF < 0.01, and 38.5% (9,433) 
have MAF < 0.0008 (singleton). In terms of the putative 
function, 13,572 SNPs are nonsynonymous, 10,113 SNPs 
are synonymous, and 802 SNPs have unknown functional 
annotation. SNPs with MAF < 0.05 (rare SNPs) were 
included in this study. A total of 21,355 rare SNPs on 
2,881 genes were available for the primary gene-based ana- 
lysis. The nonsynonymous SNPs with MAF < 0.05 were 
analyzed separately to understand the effect of the putative 
function of SNPs. A total of 12,193 nonsynonymous rare 
SNPs on 2,015 genes were included for the secondary 
gene-based analysis. Seven hundred ten genes had only 
one nonsynonymous SNP, and the highest number of 
nonsynonymous SNPs within a gene was 151. 

Mantel test of correlation between data matrices 

The Mantel test is a statistical test of the dependence 
between the elements of two matrices [2]. Usually, the two 
matrices contain data from multiple variables obtained on 
a common sample of subjects. The rows of the two 
matrices correspond to the subjects in the same order, and 
the columns contain data on the two sets of variables. For 
n subjects with two variables X and Y, we first calculate 
two distance matrices, each with n x (n - l)/2 pairwise 
distances. The Mantel statistic is based on a cross-product 
term: 

n n 

i=i j=i 

where X and Y are variables measured for the subjects, 
n is the number of subjects in the distance matrices, and 
Xy and Y t j are the pairwise distances between subject i 



and j for variables X and Y. Because the elements of a 
distance matrix are not independent, it is not straight- 
forward to determine the significance level for the corre- 
lation (i.e., Mantel statistic Z) between two distance 
matrices. The Mantel test provides an alternative way to 
quantify the dependence and provides a significance 
level that is usually evaluated with a permutation proce- 
dure. Mantel's statistics Z are computed for each per- 
muted distance matrix by shuffling the rows and 
columns. The distribution of Z's is generated by a large 
number of iterations. 

Although the Mantel test was initially developed to 
identify the space-time clustering in epidemiological 
data, it has been widely adopted in other fields, such as 
ecology [5,6]. The Mantel test has also been applied in 
studies of gene expression profiles and genetics of 
human diseases [7,8]. Beckmann et al. [8] demonstrated 
that the Mantel test has better power than the chi- 
square test for gene mapping using haplotype sharing as 
a measure of genetic similarity. 

Application of the Mantel test for identifying gene-based 
correlations 

For a given gene, the genetic distance between each pair 
of subjects (i.e., X t j) is calculated using the sum of differ- 
ences of the additive effects on each rare SNP. For a 
SNP, the distance between two homozygotes (AA and 
ad) is 2 and the distance between a homozygote (AA or 
ad) and a heterozygote (Ad) is 1. The genetic distance 
on the gene level equals the sum of the genetic distance 
of individual SNPs. For a gene involving two biallelic 
loci, Ala and Bib, the genetic distance between a pair of 
individuals ranges from 0 (same genotype) to 4 (AABB 
vs. aabb). 

We calculate the distance matrices of all rare SNPs 
(MAF < 0.05) and nonsynonymous SNPs separately for 
each gene. The phenotypic distance (i.e., Yy) equals the 
absolute difference of the phenotypic values for a pair of 
individuals (\Y t - Yj\). For the quantitative trait Ql, we 
calculate the phenotypic distances among the unad- 
justed measurements and their Age, Sex, Smoke, and 
population stratification adjusted residues. For a binary 
outcome, the distance between a case subject and a con- 
trol subject is 1, and the distance among case subjects 
or among control subjects is 0. Using both the genetic 
and phenotypic distance matrices, we calculate the Man- 
tel statistic Z in Eq. (1) for each gene and a trait. 

To estimate the statistical significance of the Mantel 
statistic Z, we first run 500 permutations to compute the 
empirical ^-value for each gene with at least one rare var- 
iant. For genes with a permutation j?-value less than 
0.002, we rerun the permutation test 100,000 times to 
obtain a greater precision of the j?-value as low as 10~ 5 . 
We calculate type I error using quantitative trait Q4, 
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which has no association with the genetic variants, in 200 
replicates. We also calculate the power for the nine genes 
with causal SNPs of quantitative trait Ql using three sig- 
nificance thresholds (0.05, 0.001, and 0.0001). All statisti- 
cal analyses were conducted using statistical software R, 
version 2.10. The Mantel test was implemented in R 
library ade4 [9]. 

Results 

Because of the availability of the underlying genetic 
model of simulation for GAW17, we first examined the 
type I error and power of our gene-based method. Using 
Q4, which has no simulated genetic associations, we esti- 
mated that the mean of the type I error (a level of 0.05) 
for our Mantel test-based method was 0.051 in 200 repli- 
cate data sets. The power analysis of this method is sum- 
marized in Table 1. By combining the nonsynonymous 
SNPs with MAF < 0.05 of a gene, we were able to calcu- 
late the true-positive rate of the nine genes with simu- 
lated genetic associations. Each gene included at least 
one causal SNP with various effect sizes. At an a level of 
0.05, three genes were identified with reasonable power: 
100% power for FLT1, 96% power for KDR, and 78% 
power for VEGFC. Considering a large number of tested 
genes in this study, we also examine the power at a levels 
of 0.001 and 0.0001. We have power to detect only FLT1 
using these lower thresholds (Table 1). 

Following the procedure described in the Methods sec- 
tion, we next tested the association between the genetic 
dissimilarity of each gene and the phenotypic dissimilarity 
represented by the distance matrices of Ql and Affected. 
After excluding all common SNPs, we conducted the 
gene-based analysis using either all SNPs or only nonsy- 
nonymous SNPs for each trait, using up to 100,000 per- 
mutation tests. For the quantitative trait Ql, we also 
considered models with and without adjustment of covari- 
ates (Age, Sex, Smoke, and population stratification). 

The most significant association results (significantly 
associated with the outcome more than 10 times out of 



200 replicates, i.e., 5%) of the four models, three for Ql 
and one for Affected, are summarized in Table 2. Using a 
stringent significance threshold of a Bonferroni-corrected 
j?-value of 0.05 (empirical ^-values of 1.74 x 10~ 5 for 
SNPs with MAF < 0.05 and 2.48 x 1(T 5 for nonsynon- 
ymous SNPs with MAF < 0.05), we identified four genes 
associated with Ql and one gene associated with Affected 
in more than 10 of the 200 simulated data sets. Without 
any adjustment of covariates, FLT1 (32 rare SNPs), 
PIK3C3 (7 rare SNPs), KDR (15 rare SNPs), and PRR4 
(17 rare SNPs) were significantly associated with Ql 49, 
13, 12, and 11 times, respectively, out of 200 simulation 
data sets. With adjustment of Age, Sex, Smoke, and the 
first two principal components of all SNPs (representing 
population stratification among subjects), FLT1 was sig- 
nificant 39 times. When we tested the gene-based asso- 
ciation by considering only nonsynonymous SNPs, we 
found that FLT1 (19 nonsynonymous SNPs) and KDR 
(10 nonsynonymous SNPs) were significant 170 (85%) 
and 24 (12%) times, respectively. For the binary outcome 
Affected, no gene was significant more than 10 times 
among 200 simulated data sets using all SNPs with MAF 
< 0.05. The most significant genes, MAP3K12 (17 rare 
SNPs) and PIK3C2B (62 rare SNPs), were significant 10 
and 5 times, respectively. When we restricted the analysis 
to only nonsynonymous SNPs, we found that FLT1 (19 
nonsynonymous SNPs) was significant 13 times. 

Discussion 

The exome sequencing data measure a large number of 
rare variants in which a subset may be jointly associated 
with disease phenotypes. The gene-based analyses can 
divide these rare variants into genes as a unit and imply a 
relationship between a gene and a phenotype. In this 
study, we developed a gene-based Mantel test to assess 
the correlation between a phenotype and all rare variants 
of a gene. The Mantel test is capable of evaluating the 
relationship between the distance matrices of phenotype 
and genotype using a permutation process. Although this 



Table 1 Power of identifying nine genes with simulated genetic association 



Gene Power Number of causal SNPs/total SNPs 





a = 0.05 


a = 0.001 


a = 0.0001 


All SNPs 


MAF < 5% 


MAF > 5% 


ARNT 


0.255 


0 


0 


5/18 


5/17 


0/1 


ELAVL4 


0.01 


0 


0 


2/10 


2/8 


0/2 


FLT1 


1 


0.97 


0.95 


11/35 


10/32 


1/3 


FLT4 


0.135 


0.01 


0 


2/10 


2/10 


0/0 


HIF1A 


0.17 


0 


0 


4/8 


4/8 


0/0 


HIF3A 


0.09 


0 


0 


3/21 


3/17 


0/4 


KDR 


0.955 


0.525 


0.245 


10/16 


9/15 


1/1 


VEGFA 


0.225 


0.005 


0.005 


1/6 


1/6 


0/0 


VEGFC 


0.775 


0 


0 


1/1 


1/1 


0/0 
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Table 2 Most significant genes correlated with Q1 and Affected 

Trait Gene Chromosome Gene start Gene end Gene length Number of Number of significant 



(bp) (bp) (bp) SNPs tests 3 

Q1 (SNPs with MAF < 0.05) FLT1 13 27774389 27967265 192877 32 49 

PIK3C3 18 37829928 37789197 126250 7 13 

KDR 4 55639406 55686519 47114 15 12 

PRR4 12 10889715 11215480 325766 17 11 

Q1 (SNPs with MAF < 0.05) b FLT1 13 27774389 27967265 192877 32 39 

Q1 (nonsynonymous SNPs with MAF < FLT1 13 27774389 27967265 192877 19 170 
0.05) b 

KDR 4 55639406 55686519 47114 10 24 

Affected (nonsynonymous SNPs with FLT1 13 27774389 27967265 192877 19 13 
MAF < 0.05) 



a Number of significant tests out of 200 simulated data sets. The threshold of statistical significance is a Bonferroni-corrected p-value of 0.05 (1.74 x 10 5 for 
SNPs with MAF < 0.05 and 2.48 x 10~ 5 for nonsynonymous SNPs with MAF < 0.05). 
b Adjusted for Age, Sex, Smoke, and first two principal components. 



method can be applied to summarize any type of genetic 
variant, including common variants within a genomic 
region, in this study we focused on identifying rare var- 
iants, which may not have sufficient power to be detected 
individually. We applied this method to the GAW17 
unrelated individuals data and identified genes with rare 
SNPs that were significantly correlated with one quanti- 
tative trait and the binary trait. Using the 200 simulated 
data sets and comparing with the underlying genetic 
models, we found that this method had an expected type 
I error rate and that the power to detect gene-level asso- 
ciation of rare variants depended on the number of cau- 
sal rare variants and their effect size. When an 
appropriate subset of SNPs, such as SNPs with low MAF 
and nonsynonymous SNPs, and an appropriate adjust- 
ment model were selected, the method had improved 
performance in detecting the associated genes. Using the 
stringent Bonferroni correction for multiple testing 
implemented in this study, we were not surprised by the 
number of false negatives. Using a less stringent correc- 
tion for multiple testing (e.g., false discovery rate #-value) 
may help to reduce the false-negative rate. 

The method we implemented here provides an alterna- 
tive way to test the relationship between a phenotype and 
all genetic variants located within a gene. The Mantel 
test is designed to identify not only the association 
between predictors and outcome but also the clustering 
of the events within the predictor-outcome space [2]. In 
addition, this framework is flexible so that any set of rare 
variants can be combined to test their joint correlation 
with a phenotype. For example, we can test the hypoth- 
esis involving all genes in a known pathway, a biological 
network, or any set of genes grouped by a proposed 
mechanism. Another advantage of our method is its cap- 
ability of handling different genetic models. Here, we 
coded the genotypes using an additive effect model and 
calculated the distance matrix of a gene. Similarly, we 



could have tested the correlation of the dominant or 
recessive effects by modifying the coding of the geno- 
types and calculating the new genetic distance matrices. 

The Mantel test assesses a global-level relationship (i.e., 
correlation) for all variables. It cannot select the most 
influential independent variables. In this study of exome 
sequencing data, we could make inferences on the gene 
level but were limited in how much we could narrow 
down the list of causal variants under the framework of 
the Mantel test. Combining our gene -based approach with 
variable selection or ordination methods may facilitate the 
process of uncovering causal variants of human disease. In 
the analyses for identifying genes with multiple rare var- 
iants jointly correlated with disease traits using the exome 
sequencing data, we believe that the Mantel test can play 
an important role in understanding the complicated 
genetic effects of rare variants. Further developments are 
needed to extend the utility of the Mantel test in whole- 
exome sequencing data and for fine mapping of causal 
variants. 
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